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LONG  COMMON  SUBSEQUENCES  AND  THE  PROXIMITY 
OF  TWO  RANDOM  STRINGS 


By 

J.  Michael  Steele 

I.  INTRODUCTION 

Long  molecules  such  as  proteins  and  nucleic  acids  can  be  thought 
of  schematically  as  sequences  from  a  finite  alphabet  Q.  From  an 
evolutionary  point  of  view  it  is  natural  to  compare  molecules  by 
considering  their  conmon  ancestors,  and  in  schematic  terms  this  reduces 
to  the  problem  of  considering  the  longest  common  subsequence  of  two 
given  sequences. 

Sankoff  (1972)  gave  an  efficient  algorithm  for  calculating  the 
length  of  the  longest  conmon  subsequence.  Subsequently,  Sankoff  and 
Cedergren  (1973),  and  Sankoff,  Cedergren,  and  Lapalme  (1976)  considered 
a  number  of  empirical  cases  and  conducted  some  Monte  Carlo  invest lgatlona* 
The  first  formal  probabilistic  analysis  of  the  problem  of  long  common  sub*> 
sequences  was  initiated  in  Chvdtal  and  Sankoff  (1975).  To  describe  their 
work  we  first  introduce  some  notation. 

By  X^  and  X*,  1<  i<«  ,  we  denote  two  sequences  of  independent, 

and  identically  distributed  random  variables  with  values  in  G.  The 
random  variable  of  main  Interest  is 

L  :■  max{k:  X.  -  X  ’  ,  X.  -X'  ,...,X  -X'  where 

n  *1  ^l  X2  ^2  *k  ^k 

1  £  i^  <  i^  <  •••  <  i^  £  n  and  1  £  <  •••  <  £  n)  • 
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In  words,  is  the  largest  cardinality  of  any  subsequence  common  to 
the  sequences  . . .  ,X^)  and  (X^X^, .  *  ♦  ,X^)  . 

Under  the  assumption  that  |g|  =  k  and  that  X^  and  X'  are  both 
uniform  on  G,  Chvatal  and  Sankoff  proved  the  existence  of  the  limit 
of  the  means, 

(1.1)  11m  EL  /n  *  c,  . 

n-~  n  k 

Among  other  results,  Chvatal  and  Sankoff  obtained  upper  and  lower 

bounds  on  c^.  These  authors  proved  no  results  for  Var  L^,  but  on 

the  basis  of  a  Monte-Carlo  study  they  were  lead  to  conjecture 

Var  L  -  o(n2^3). 
n 

Deken  (1979)  was  able  to  sharpen  the  bounds  on  c  ,  and  also  noted 

k 

that  as  a  consequence  of  Kingman's  subaddltlve  ergodic  theorem  (Kingman 
(1968)),  that  one  actually  has 

(1.2)  llm  L  /n  «  c  a.s. 

_  _  n 

n  -►  00 

where  c  depends  on  the  distributions  of  the  processes  {(X^,Y^)r  l£  i^00). 

2 

This  result  naturally  entails  Var  L  ■  o(n  ),  but  no  futher 
progress  was  made  on  the  variance  problem. 

The  present  article  takes  up  several  aspects  of  the  study  of  Lr. 

In  the  second  section  as  an  elementary  application  of  an  Inequality  of 
Efron  and  Stein  (1980),  it  is  proved  that  Var  ■  0(n).  This  makes 

2/3 

only  modest  progress  on  the  Chvdtal-Sankof f  conjecture  that  Var  L  mo(n  ), 

n 

but  it  still  serves  to  supplement  (1.2)  with  a  rate  of  convergence  result. 
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The  third  section  takes  up  the  question  of  the  behavior  of  Lq 
under  more  general  assumptions  than  Independence.  A  simple  complement 
of  Kingman's  subadditive  ergodic  theorem  (Kingman  (1973))  Is  derived 
and  then  applied  to  LQ.  The  coupling  method  which  is  used  here  (or 
the  Radon-Nikodym  method  which  Is  sketched)  may  likely  be  of  use  In 
many  other  problems  where  subaddltlvlty  Is  available,  but  statlonarlty 
is  absent. 

The  fourth  section  branches  out  from  the  explicit  analysis  of  Lq. 

It  addresses  the  question  of  whether  there  exist  statistics  which  are 
more  tractable  than  Lr,  but  which  still  reasonably  measure  the  genetic 
proximity  of  long  molecules.  The  principal  new  candidate  Is  Tq,  the 
total  number  of  common  subsequences.  Here  one  can  compute  ETr  exactly, 
but  we  note  Tq  has  other  draw-backs  to  its  analysis. 

The  final  section  makes  brief  comment  on  some  open  problems  and 
related  literature. 


Acknowledgement .  The  observation  that  absolute  continuity  provides  a 
second  proof  of  Corollary  1  in  Section  III  is  due  to  Steve  Lalley  who 
kindly  commented  on  an  earlier  draft  on  this  article. 
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II.  A  VARIANCE  BOUND 

Let  SCv^v^, . . .  ,vr  denote  any  real  valued  function  of  n-1 

vectors  e  ;  and  suppose  V^,  1  £  1  <  *»,  is  any  sequence  of 

independent  Identically  distributed,  random  vectors  in  H We 

then  define  new  random  variables  #  *  ,Vi-l*^i+l*  * # "  ,Vn^ 

for  1  <  i  <  n,  and  we  further  set  S  -  —  E*?  .  S..  Tukeyfs 
—  —  *  •  n  1-1  i  y 

Jackknife  estimate  for  the  variance  of  S  is  (S^-S#)  ,  and 

Efron  and  Stein  (1980)  have  proved  the  very  useful  inequality, 

n  2 

(2.1)  Var(S.)  <E  J  (S.-S.)  . 

1-1  1 

The  main  point  of  this  section  is  to  show  that  (2.1)  leads  to 
the  bound 

(2.2)  Var  L  -  0(n)  , 

n 

under  the  general  assumption  that  *  (X^X*)  are  Independent,  and 
identically  distributed.  In  fact,  one  can  prove  the  following  result. 

Theorem  1.  For  each  n,  suppose  there  is  defined  a  function 

d  n 

S(x. ,x9, • . . ,x  )  from  (Jt  )  to  H  .  Suppose  also  that  V., 
i  *  n  l 

1  £  i  <  «»,  is  any  sequence  of  Independent  random  vectors  in 
and  for  1  £  i  £  n,  1  £  n  <  00  set 

(2.3)  S,  „  -  S(VrV2 . »H,VW . Vn)  . 
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If  E(S.  -  S.  )*  is  bounded  for  ell  1  <  i  <  1  <  n  and  1  <  n  <  •*  . 

i,n  J,n  _  j  _  —  * 

then 


(2.4) 


Var  S(VltV2 . Vn>  -  0(n) 


Proof.  Let  the  bound  on  E(S,  - S.  be  B.  Fix  n,  define 

i,n  j,n 

S-  •  n  lU  \n-  **“  l“ 


(2.5)  D  -  S(V.,V,, ...,V  .)  -  S. 

n  x  z  n-i  • 


1  n 

-  -  X  (S0rrv2 . vn-l)  ~  s(vi*v2 . vi-i*vi+i . V5 


By  Schwarz'  inequality. 


(2.6) 


and 


Var  S(V.,V„...,V  )  <  Var(S.)  +  Var  D  +2(Var  S  )1/2(Var  D  )1/2  , 
n  —  •  n  •  n 


(2.7)  Var  D  <  ED2  <  B  . 

n  —  n  — 

Since  E(S.  -  S,  )2  <  B  one  also  has  E((S.  - S  ) 2)  <  B.  So 

i,n  J,n  —  i,n  •  — 

inequalities  (2.1),  (2.6),  and  (2.7)  entail 

(2.8)  Var  S(Vl#V2,..,,V  .)  <  nB  +  B  +  2(nB)1/2  B1/2  -  B(nl/2+l)2  . 
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This  completes  the  proof  of  the  Theorem  with  a  very  specific  form 


of  the  0(n)  term  ■ . 

Returning  to  L  we  note  that  for  V  *  (X, ,X!)  and 
n  ill 

G*  {1,2,...}  that  Theorem  1  is  applicable  to  S(V^,V2 . V  )  “ 

Ln(Vl,V2*  *  *  * » V  H  V  Since 

(2.9)  0<L(VlfV2 . V‘L(VV2 . Vi-l»Vi+l . V  -  1  » 

it  is  trivial  that  (2.8)  can  be  taken  with  B  *  1.  In  summary  we  have 
the  following  bound. 

Corollary  1.  If  (X^,Xj)  are  i.i.d.  with  values  in  G  *  G  then 

(2.10)  Var  L  .  <  (n^2  +  l)2  . 

n-i  — 

By  the  usual  Borel-Cantelli  and  subsequence  arguments  together  with 
(2.9)  and  (2.10)  one  can  prove  a  rate  result. 

Corollary  2.  We  have  for  all  c  >  0  that 

(2.11)  Ln-E  Ln  •  o(n^^+€)  with  probability  one  . 

Since  the  techniques  for  proving  (2.11)  are  well-known  and  since  the 

result  is  not  the  best  possible,  there  is  no  reason  to  include  the  proof. 

This  is  nevertheless  the  first  rate  result  available  on  L  ,  since  such 

n 

rates  cannot  be  obtained  in  general  from  the  subadditive  ergodlc  theorem 
(c.f.  Hamers! ey  (1978),  p.  670). 


mmrnm 


III.  NON-STATIONARY  SEQUENCES 

By  Deken's  observation  ve  know  Kingman's  theorem  implies  that  L^/n 
converges  almost  surely  under  the  assumption  that  the  V^,  1  £  i  <  ® 
form  a  stationary  sequence.  The  point  of  this  section  is  to  give  a 
very  simple  illustration  of  how  Kingman's  theorem  can  also  be  used  for 
non-stationary  processes.  Naturally,  one  must  appeal  to  some  underlying 
asymptotic  stationarity,  but  the  resulting  class  of  results  seem  useful 
enough  to  merit  recording.  In  particular,  one  should  compare  the  present 
result  to  the  "sub-stationary"  subadditive  ergodic  theorem  of  Abid  (1979) . 
That  result  apparently  does  not  suffice  for  the  application  to  Lr  given 
here,  and  it  is  considerably  more  complicated. 

By  a  subadditive  sequence  of  function  on  E  we  denote  a  sequence 

h  :  En  H.  which  satisfies 
n 


(3.1) 


hnrti/el,e2’' 


.<el’V 


,,em)  ‘•’bn*6**-!*6****’ 


,en*Hn^ 


As  an  example,  we  note  that  if  E  ■  G*  C  and  e^  •  ^ai,ap*  tlien 
letting  ^n(ei»e2> • • • »en^  denote  the  length  of  the  longest  common 
subsequence  of  (a^,a^,  •  •  •  ,an)  and  (a^a^  •  •  •  »a^)  one  has  (3.1). 
Because  of  the  applications  we  have  in  view,  we  will  also  focus  on 
monotone  subadditive  functions,  i.e.  those  functions  which  satisfy 
(3.1)  as  well  as 


°'2)  hn-a(xnri-l»x®f2 . xt?  -  Vxl*x2 . xn)  for  a11  “  -  " 


and  {x^.x^ . xq}. 


7 


on  the  discrete 


r  ■» 00 

We  will  say  that  a  stochastic  process  {X. } 

1  i-l 

state  space  E  has  a  stationary  ergodic  coupling  if  there  is  a 

stationary  ergodic  process  IX’}  on  the  same  probability  space 

1  i-l 

such  that  -  (X^,xp  is  a  coupling,  i.e.  such  that  the  stopping 
time  T  ■  min{i:  X^  -  Xp  is  finite  with  probability  one. 

It  is  well-known  that  couplings  are  a  convenient  and  powerful 
way  of  expressing  the  asymptotic  properties  of  stochastic  processes 
(see  e.g.  Griffeath  (1978)).  The  next  result  illustrates  this  ease 
of  application. 


Thereom  2.  Suppose  that  h  is  a  positive  and  monotone  sequence 

of  subadditive  functions  on  E.  If  (X4)  is  a  stochastic  process 

i=l 

with  state  space  E  for  which  there  is  stationary  ergodic  coupling  then 


(3.3)  lim  h  (Xt,X0,...,X  )/n  -  c  a.s. 

.  ^  n  -L  ^  n 

h  -►  00 


for  some  constant  c. 


r  •  t  00 

Proof.  Let  (X!J  denote  the  stationary  ergodic  process  to 
00  1  i=1 

which  (X  }  may  be  coupled,  and  let  x  be  the  coupling  time,  i.e. 

1  i-l 

x  -  minfi:  X.  =  X*}.  The  doubly  indexed  process  Y  «h  (X*  .  ,X*  . . .  ,X') 

11  St  t— s  s+i  s+z  t 

is  easily  checked  to  have  the  properties: 


(3.4a)  Y  <  Y  +  Y.  ,  whenever  s  <  t  <  u  . 

su  —  st  tu 


(3.4b)  The  joint  distributions  of  the  shifted  process 
are  the  same  as  those  of  the  unshifted  process. 

and 
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(3.4c)  The  expectations  satisfy  >_  -At  for  some 

A  and  for  all  t. 

The  properties  (3.4a-c)  are  exactly  the  hypotheses  of  Kingman's 
theorem  (Kingman  (1973)),  so  by  its  conclusion  we  have 


(3.5) 


11m  Y  / n  -  11m  h  (X'Xl . X')n  -  c  a.s. 

.  „  U,m  _  .  ^  n  I  £  n 

n  -+•  00  n  00 


Here,  to  conclude  that  the  limit  is  indeed  a  constant  we  have  made 

use  of  the  fact  that  Kingman's  theorem  assures  that  the  limit  is  shift 

,  .00 

invariant  and  we  have  assumed  that  IX'}  is  ergodic. 

i  i-1 

Now  we  have  by  (3.1),  (3.2),  and  the  definition  of  t  that 

(3.6)  h^Xj.Xj . Xn)  <  hT(xrx2, . . .  ,xT)  6  6n.T(x4+1,x'+2 . x;> 

<  hT(X1,X2 . XT)  +  hn(Xj,X^,...,X^)  . 

Since  T  <  00  ?-h  probability  one,  (3.4)  and  (3.6)  yield 


(3.7) 


lim  h  (X. ,X„,...,X  )/n  <  c 
n+®  n  1  2  n 


To  handle  the  limit  inflmum  we  need  only  consider  the  analogous  inequality 
with  the  variables  reversed,  i.e. 

hn(xj,x£ . x^)<hT(x{,x» . x')  +  hn(x1,x2,...,xn)  , 


and  we  obtain 


c  <  lim  h  (X.,X„,...,X  )/n 
- n  i  /  n 


to  complete  the  proof  0 • 

Corollary  1.  If  V^,  1  _<  i  <  °°,  is  an  irreducible,  aperiodic, 

positive  recurrent  Markov  chain  with  state  space  G  *  G  then  no  matter 
what  the  initial  distribution  tt(v)  -  P(V^=v),  one  has  with  probability 
one 


lim  Ln(V1,V2,_.,Vn)/n  =  c 
n  ■+  00 

for  some  constant  c. 

To  prove  the  corollary  one  only  has  to  exhibit  an  appropriate 

coupling;  and,  in  this  case,  the  existence  of  such  a  coupling  is  well- 

known  (.see  e.g.  Hoel,  Port,  and  Stone  (1972)). 

One  can  also  prove  the  above  corollary  without  recourse  to  coupling; 

one  can  use  an  absolute  continuity  argument.  Under  the  hypotheses  of 

the  corollary  there  is  a  stationary  measure  tt*.  Moreover,  the  initial  measure 

77  is  absolutely  continuous  with  respect  to  tt  1  (since  by  the  irre- 

ducibility  and  positive  recurrence  7r,(a^,a2)  >  0,  for  all  (a^a^eG^). 

If  {V^:  1  <.  i  <  00 }  is  the  process  with  initial  distribution  tt’  and 

the  same  transitions  function  as  (V^:  1  <.  i  <  °°}  »  it  is  further  true 

that  the  measure  p  for  the  infinite  process  (V^:  1  £  i  <°°}  is  absolutely 

continuous  with  respect  to  that  p'  for  (V^:  1  <_  i  <  »}  .  Since 

L(Vi»V2* • • • >v^)  satisfies  the  hypotheses  of  Kingman’s  subadditive 

ergodlc  theorem,  (oj:  lim  L(V’  .  ,V^)/n  *  c)  is  a  set  of  p’ 

n  -►  a> 

measure  one.  By  absolute  continuity  of  p  «  p’  the  set 

{co:  lim  L(V^,V2, . . .  ,V  )/n  ■  c)  has  p  measure  one.  This  is  precisely 
n  oo 

the  conclusion  of  the  corollary. 
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IV.  ALTERNATIVE  STATISTICS 

The  random  variable  L(V^,V2* . . . ,Vn)  certainly  Is  an  Interesting 
measure  of  genetic  proximity,  but  it  appears  to  be  hard  to  handle.  In 
such  a  situation  it  is  natural  to  look  for  suitable  alternatives. 

To  introduce  one  such  alternative  let  (X^.Xj, . . . ,X^)  and 
(X^.X^, . . . ,X^)  denote  two  sequences  of  values  from  G>  By  A,  B 
we  denote  subsets  of  {1,2,.  ..,n),  say  A*  {i^.l^, . . .  ,1^)  and 
B  *  • • ■ tjfc)  If  |a|  -  I B |  »  k.  Next  we  set 


(4.1) 


(1 

k,B)  =  J  as  X.  -X'  ,  X  -X'  ,...,X  «X'  ,  or  not. 

(o  h  J1  H  J2  Tc  Jk 


The  statistic  of  Interest  in  this  section  is 


(4.2) 


T -  l  P(A,B)  , 
A,B 


where  the  sum  is  over  two  pairs  of  subsets  of  (l,2,...,n)  and  it  is 
understood  that  p(A,B)  is  taken  to  be  zero  if  the  cardinalities  of 
A  and  B  differ,  i.e.  |a|  i  |b|. 

If  the  X.,  1  £  i  <  ®  and  the  X*,  1  £  i  <  00  are  all  independent, 
and  P(Xt  ■  a^)  ■  p  ^ ,  P(X^  -  a^)  -  pj  for  all  i,  j,  it  is  easy  to 
see  that 


(4.3) 


rn  -  l  O'  (  l  PjPj)k  • 

n  k-0  K  j-1  J  3 


This  explicit  formula  is  quite  a  contrast  to  the  Kystery  surrounding 
ELr  under  similar  hypotheses,  A  number  of  qualitative  properties  of  ET^ 
are  also  evident  from  (4.3).  In  particular,  if  we  set  p^  =  p^  for 
1  £  ±  £  Ig|  *  a  and  take  G  finite,  then 

<Kp)  =  E  T  -  l  (?)  (  l  pj)k 
P  k-0  R  j-1  J 

is  easily  checked  to  be  a  Schur-convex  function,  i.e.  4>(^)  <t>(pf) 

whenever  p  is  majorized  by  (For  an  elaboration  of  this  terminology 

see  Hardy,  Littlewood,  and  Polya  (1951),  and  for  an  elaboration  of  the 
many  consequences  of  Schur-convexity  see  the  treatise  by  Olkin  and  Marshall 
(1979)). 

Despite  the  mathematical  simplicity  of  as  evidenced  by  (4.3) 

and  (4.4),  it  provides  only  a  partial  surrogate  for  L^.  In  the  first 
place  tends  to  be  very  large,  and  there  is  no  efficient  algorithm 

for  finding  T^.  Thus,  from  a  computational  view  point,  is  a  superior 

statistic.  Also,  as  of  yet,  there  is  no  information  at  all  about  the 
variance  of  T^  or  of  its  limit  properties. 


V.  OPEN  PROBLEMS 


The  main  open  problems  concern  the  expectations 

(5.1)  *n(p)  -  ELn 

under  the  hypotheses  of  Independence  and  Identical  distribution  as  applied 
in  (4. A), 

For  one  explicit  conjecture,  it  seems  inevitable  that  ^(p)  is 
Schur  convex  (just  as  $(p)  was  proved  to  be).  Perhaps  it  would  be 
easier  to  consider  the  limit, 

(5.2)  iMp)  -  lim  iJj  (p)/n  . 

n  -  n 

Again,  it  must  be  true  that  \Jj(p)  is  Schur  convex,  but  so  far  even  this 
has  not  been  proved. 

The  older  problems  concern  the  numerical  value  of  *Kp).  Perhaps 
progress  can  be  made  on  this  problem  by  taking  a  more  algorithmic  point 
of  view.  Is  there  an  efficient  algorithm  for  computing  the  approximate 
value  of  iKp)  or  ^n(p)  with  a  guaranteed  error  bound? 

Given  the  results  of  Section  2,  it  is  very  interesting  to  see  if  one 
can  improve  (2.10)  to  show  Var  *  o(n).  This  would  be  the  first 
really  non-trivial  step  toward  the  Chvital-Sankoff  conjecture,  and  it 
would  seem  to  require  some  genuinely  new  combinatorial  insight  to  settle 
the  point  one  way  or  the  other. 

Finally,  the  main  scientific  problem  Is  to  find  a  replacement  for 

L  which  still  has  a  genetic  justification.  The  null  distributions  of 
n 
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Lr  seem  like  they  will  always  be  out  of  reach,  and  major  progress  will 

be  made  when  L  finds  a  suitable  substitute.  The  statistic  T  is  a 
n  n 

reasonable  first  choice,  but  it  leads  to  its  own  problems.  For  example 

what  is  the  order  of  the  growth  of  Var  T  ? 

n 

In  the  search  for  surrogates  for  Lr,  it  may  be  critical  to 
consider  the  variety  of  problems  to  which  it  has  been  applied.  In  addi¬ 
tion  to  the  application  to  molecule  comparisons  noted  previously,  there 
is  a  natural  application  in  communications.  In  particular,  Bradley  and 
Bradley  (1978)  have  applied  in  the  study  of  bird  songs. 

There  are  also  a  variety  of  potential  uses  in  computer  science  and 
for  an  introduction  there  it  seems  useful  to  refer  to  the  papers  of 
Aho,  Hirschberg,  and  Ullman  (1976),  Okuda,  Tanaka,  and  Kasai  (1975), 
Selkow  (1977),  and  Wagner  and  Fischer  (1974).  In  at  least  some  of  these 
papers  in  which  has  been  used,  it  seems  there  must  exist  a  more 


tractible  substitute. 
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